Skip to content

feat(filesystem): add PageIndex FileSystem and PIFS CLI#302

Open
BukeLy wants to merge 83 commits into
VectifyAI:mainfrom
BukeLy:feat/pageindex-filesystem
Open

feat(filesystem): add PageIndex FileSystem and PIFS CLI#302
BukeLy wants to merge 83 commits into
VectifyAI:mainfrom
BukeLy:feat/pageindex-filesystem

Conversation

@BukeLy
Copy link
Copy Markdown
Collaborator

@BukeLy BukeLy commented May 26, 2026

Summary

This PR adds PageIndex FileSystem (PIFS): a filesystem-like interaction system for agents working inside a PageIndex workspace, plus a pifs CLI and an ask / chat loop built on the same command surface.

The core purpose is to help an agent quickly locate the right file in a workspace, then combine that filesystem context with PageIndex structure, metadata, and projection indexes to retrieve precise file evidence.

Goal

PIFS gives agents a stable filesystem-like interface to PageIndex workspaces:

  • inspect workspace shape with familiar folder commands;
  • locate relevant files inside an explicit path scope;
  • use PageIndex-backed structure, metadata, and semantic projections for precise evidence retrieval;
  • keep reads bounded, auditable, and tied to concrete virtual file paths.

browse is part of the file-location step. It ranks file candidates within a folder scope when folder names and exact filters are not enough. Evidence still comes from bounded cat, grep, and PageIndex structural reads.

What Changed

  • Added the PIFS core model: virtual folders, registered files, metadata status, PageIndex/projection status, path resolution, and SQLite persistence.
  • Added pifs, a shell-style CLI for workspace navigation, file discovery, metadata filtering, source reads, imports, and agent execution.
  • Added PageIndex-backed registration for PDF, Markdown, and text files, including structural reads, generated metadata, and summary projection indexing.
  • Added pifs add for atomic local imports into workspace-owned artifacts.
  • Added pifs ask and pifs chat, where the agent uses the same read-only filesystem commands available to users.
  • Added PIFS Semantic Folder as an explicit build step for flat or weakly organized corpora: pifs semantic-folder build [source_scope] materializes a generated <source_scope>/semantic tree from canonicalized domain / topic metadata.
  • Added command guardrails for bounded reads, lexical grep -R, path ambiguity, projection dimension mismatches, atomic import cleanup, and semantic-folder rebuild safety.
  • Expanded regression coverage across storage, command parsing/rendering, registration, add rollback, browse behavior, structural reads, metadata generation, semantic indexes, and semantic-folder materialization.

Command Surface

  • Global flags: --workspace, --env-file, --json
  • Workspace defaults: pifs set workspace <path>
  • Navigation and inspection: pifs ls, pifs tree, pifs find, pifs stat
  • File discovery: pifs browse [-R] <folder> "<query>" [--space summary|entity|relation] [--where JSON] [--page N]
  • Evidence reads: pifs cat <path> --structure|--page|--range|--all, pifs grep [-R] <pattern> <path>
  • Imports and generated views: pifs add <physical_path> <virtual_path>, pifs semantic-folder build [source_scope]
  • Agent loop: pifs ask "<question>", pifs chat

The agent command surface intentionally exposes only read/navigation commands: ls, tree, find, browse, grep, cat, and stat. It can use an existing semantic folder like any other tree, but it cannot build one.

Key Files

  • pageindex/filesystem/core.py: high-level PIFS API, registration flow, metadata generation, projection wiring, semantic-folder build orchestration, and browse behavior.
  • pageindex/filesystem/store.py: SQLite workspace catalog for folders, files, metadata, generated memberships, and PageIndex/projection state.
  • pageindex/filesystem/commands.py: command parser, executor, shell rendering, capabilities, and guardrail messages.
  • pageindex/filesystem/agent.py: ask / chat policy and streaming loop over the PIFS command surface.
  • pageindex/filesystem/semantic_folder.py: Semantic Folder planner contract, OpenAI planner, plan schema, and validation rules.
  • pageindex/filesystem/semantic_projection.py and semantic_index.py: summary projection indexing and vector search adapter used by browse.
  • pageindex/filesystem/metadata.py and metadata_generation.py: metadata schema, policy, status, and generated metadata helpers.
  • pageindex/filesystem/cli.py and pifs: CLI entrypoints.
  • examples/pifs_demo.py: local end-to-end demo over example documents.

Verification

  • uv run pytest tests/test_filesystem_store.py tests/test_import_surface.py tests/test_metadata_generation.py tests/test_pageindex_filesystem_scope.py tests/test_pageindex_structural_read.py tests/test_pifs_add_command.py tests/test_pifs_agent_stream.py tests/test_pifs_cli.py tests/test_pifs_find_maxdepth.py tests/test_pifs_like_escape.py tests/test_pifs_path_resolution.py tests/test_pifs_register_side_effects.py tests/test_pifs_semantic_folder.py tests/test_semantic_index.py
  • Manual PIFS CLI/chat demo coverage on the example workspace during development.
  • Manual Semantic Folder smoke on SEC filings: /SEC_Filings_LTM/semantic built as topic/domain with 82 files, 82 memberships, and 0 skipped files.
  • Manual Semantic Folder smoke on 33capital: /33capital/semantic built as topic/domain with 18 files, 18 memberships, and 0 skipped files.

@BukeLy BukeLy force-pushed the feat/pageindex-filesystem branch from 274af6c to d7d3cb8 Compare May 26, 2026 18:08
BukeLy added 29 commits May 27, 2026 02:12
Remove the synchronous=OFF pragma from PIFS catalog inserts so SQLite remains the durable source of truth.
Route default semantic search to the summary projection when summary is the only populated semantic channel.
Only use the fresh event loop fallback for missing running-loop detection, so RuntimeError from a threaded agent run is not retried.
BukeLy added 28 commits May 31, 2026 21:37
Merge the unified browse command implementation into feat/pageindex-filesystem.
Merge stable key-value browse output into feat/pageindex-filesystem.
Merge removal of legacy semantic commands into feat/pageindex-filesystem.
Merge ask/chat retrieval strategy updates into feat/pageindex-filesystem.
Merge embedding dimension defaults and mismatch guards into feat/pageindex-filesystem.
Merge pifs add command and atomic import handling into feat/pageindex-filesystem.
Return nested PageIndex structure JSON from cat --structure and keep content reads page-based only. Remove the cat --node command surface, related limits, prompts, and structure-text fallback.
* feat(filesystem): add pifs semantic folder build

* fix(filesystem): preserve semantic folder command paths

* fix(filesystem): retry semantic folder planning

* fix(filesystem): balance semantic folder planner guidance
Copy link
Copy Markdown
Collaborator

@KylinMountain KylinMountain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated deep-review summary (recall-oriented)

Machine-assisted review (multi-agent finder + adversarial verification passes) of the PIFS change, optimized for recall of real defects. Each inline finding was independently verified against the PR-head code. Severity tags: High = breaks normal usage / crashes / silent wrong results; Med = real bug with a narrower trigger.

Positioning question first (see the requirements.txt comment)

PIFS introduces sqlite-vec + an OpenAI embedding pipeline, and pifs browse performs vector similarity search, which sits in tension with the README's "Vectorless / No Vector DB" positioning (L14 / L16 / L65 / L67). Flagging for an explicit product + docs decision, not as a code bug.

Highest-impact correctness issues (inlined below)

  • Malformed command (find /docs --where) → uncaught IndexError crashes the agent turn / CLI.
  • Read-only commands (ls/cat/find...) require an embedding key once content is indexed.
  • Stopword-only query (--name "the") silently returns [].
  • One ambiguous title aborts the whole browse.
  • Metadata filters: numeric $eq int/float mismatch, $gt/$lt excludes text-stored numerics, large-int float precision loss — three distinct holes in the same area.

Investigated but NOT flagged (verified safe)

SQLiteSession thread / shared-history concerns (SDK uses check_same_thread=False + per-instance in-memory DB); decode_vector dimension (index-level search/upsert already validate); __init__.py only swallows ModuleNotFoundError for the 4 optional deps with a re-raise guard; add_file readiness guards not reused on register (intentional deferred-metadata design); the folder-vs-file depth -1 asymmetry (actually correct — a file is one tree level below its folder).

Lower-severity (not inlined, for completeness)

cat --range past EOF → nonsensical start>end empty result; the pagination "next" command can point past document EOF; SQLiteVecSemanticIndex.reset() uses auto-committing executescript with no rollback (recoverable — rebuildable index); with self.connect() commits but never closes the connection (bounded by CPython GC); entity/relation channels are declared & plumbed but never populated (dormant — hidden from capabilities & rejected up front, only over-advertised in static help). The default-model change to gpt-5.4 looks intentional (retrieve_model was already gpt-5.4).

There are also several cleanup/DRY opportunities (OpenAI-client construction duplicated 3x; normalize_path / JSON-coercion / textwrap.shorten reimplemented; MetadataGenerator lazy-init copy-pasted 3x; per-file LLM/embedding calls that could batch) — happy to file separately if useful.

Generated with assistance; treat as input, not gospel — verify before acting.

Comment thread pageindex/filesystem/commands.py Outdated
Comment thread pageindex/filesystem/cli.py Outdated
Comment thread pageindex/filesystem/store.py
Comment thread pageindex/filesystem/core.py
Comment thread pageindex/filesystem/store.py
Comment thread pageindex/filesystem/store.py
Comment thread pageindex/filesystem/semantic_projection.py Outdated
Comment thread pageindex/filesystem/store.py
Comment thread pageindex/filesystem/semantic_projection.py
Comment thread requirements.txt
PyPDF2==3.0.1
python-dotenv==1.2.2
pyyaml==6.0.2
sqlite-vec>=0.1.9
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Positioning / architecture — for discussion] New hard dependency sqlite-vec + embedding-based similarity search. PIFS adds sqlite-vec and an OpenAI embedding pipeline (semantic_projection.py / semantic_index.py), and pifs browse performs vector similarity search. That's in tension with PageIndex's headline positioning — the README states "Vectorless, Reasoning-based RAG" and "No Vector DB ... instead of vector similarity search" (README L14 / L16 / L65 / L67).

Worth an explicit product call: is the semantic projection an intentional, clearly-scoped file-location aid (with reasoning-based structural retrieval still primary), and should that be documented so it doesn't contradict the "vectorless" claim? It also makes sqlite-vec a required dependency.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not changing this in the correctness-fix commit. The code fixes above keep semantic projection scoped to PIFS browse as a ranked file-location aid; the semantic backend returns candidate document ids for catalog resolution, while evidence still comes from bounded cat/grep/stat/PageIndex reads. The README/product wording needs an explicit docs decision rather than a silent dependency edit in this bugfix pass, so I am leaving this as positioning context for the PR discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants